In this tutorial, we start with simple R code, and we then introduce basic statistics and probabilities. Utilizing R code and packages to learn statistics methods is the core of this webpage. (here’s the presentation)

  • R

  • Probability and Statistics

  • Data with R

  • Statistics with R

  • Model with R


WHY?

Why is it important to learn statistics, and why do we use R for data analysis? In the current era characterized by big data, extensive information, and artificial intelligence (AI), statistical models form the cornerstone of AI and machine learning. Statistics provide a comprehensive framework for explaining outcomes, parameters, and nearly all aspects of these models. By utilizing R, a powerful statistical computing language, researchers can efficiently analyze and interpret complex datasets, thereby advancing the field of data science.


1 R

J.D. Long, P. Teetor (2019) present a R cookbook by bookdown. Start with install R and its compiler R studio.


1.1 Get Start

The R studio page display 4 windows unless nav-bar, The upper left window is the opening file, I prefer utilize .Rmd (R mark down). As FIG 1., we code between the area and see the result below. You also could type the code behind the > on lower left window, and click enter to run the code. Two right windows illustrate the variable, function, and packages, we will introduce how they work later.

FIG 1
FIG 1

1.2 Variables

For the problem of using <- or =, we will see the same result for setting variables but my suggestion is <-.

x <- 100
x = 100
x
## [1] 100

1.2.1 integer

int <- 10
int
## [1] 10

1.2.2 character

chr <- "I am a character"
chr
## [1] "I am a character"

1.2.3 vector

Be careful the variable name, such as c is a function fro creating vectors.

V1 <- c(1:10)
V2 <- c("2", 2)
V1
##  [1]  1  2  3  4  5  6  7  8  9 10
V2
## [1] "2" "2"

1.2.4 function

\[ f(x)=x^2+2x+1\text{, and find } f(\pi). \]

f <- function(x){ x^2+2*x+1 }
f(pi)
## [1] 17.15279

Using ?function_name to see detail in th function.

1.2.5 factor

F1 <- factor(c(1,0,1,1,0,1))
F1
## [1] 1 0 1 1 0 1
## Levels: 0 1

1.2.6 list

List1 <- list(integer1 = int,
              character1 = chr,
              vector1 = V1,
              function1 = f,
              factor1 = F1)

List1
## $integer1
## [1] 10
## 
## $character1
## [1] "I am a character"
## 
## $vector1
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $function1
## function(x){ x^2+2*x+1 }
## 
## $factor1
## [1] 1 0 1 1 0 1
## Levels: 0 1

1.2.7 matrix

M1 <- matrix(c(1,2,1,
               0,2,7),
             ncol = 3,nrow = 2,byrow = T)
M1
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    0    2    7
M2 <- matrix(c(1,0,0,
               0,1,0,
               0,0,1), ncol = 3, nrow = 3, byrow = T)
M2
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
M1%*%M2
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    0    2    7

1.2.8 data frame

df <- data.frame(Name = c("Sam", "Yuji", "Lan", "Elaine"),
                 Age = c(28,24,NA,NA),
                 Height = c(177, 187, 159, 164),
                 Gender = factor(c("M", "M", "F", "F")))
df

1.3 Loop

When you repeat the same things many times.

1.3.1 for

buliding <- c()

for(Floor in 1:101){
  buliding <- rbind(Floor, buliding)
}

buliding
##       [,1]
## Floor  101
## Floor  100
## Floor   99
## Floor   98
## Floor   97
## Floor   96
## Floor   95
## Floor   94
## Floor   93
## Floor   92
## Floor   91
## Floor   90
## Floor   89
## Floor   88
## Floor   87
## Floor   86
## Floor   85
## Floor   84
## Floor   83
## Floor   82
## Floor   81
## Floor   80
## Floor   79
## Floor   78
## Floor   77
## Floor   76
## Floor   75
## Floor   74
## Floor   73
## Floor   72
## Floor   71
## Floor   70
## Floor   69
## Floor   68
## Floor   67
## Floor   66
## Floor   65
## Floor   64
## Floor   63
## Floor   62
## Floor   61
## Floor   60
## Floor   59
## Floor   58
## Floor   57
## Floor   56
## Floor   55
## Floor   54
## Floor   53
## Floor   52
## Floor   51
## Floor   50
## Floor   49
## Floor   48
## Floor   47
## Floor   46
## Floor   45
## Floor   44
## Floor   43
## Floor   42
## Floor   41
## Floor   40
## Floor   39
## Floor   38
## Floor   37
## Floor   36
## Floor   35
## Floor   34
## Floor   33
## Floor   32
## Floor   31
## Floor   30
## Floor   29
## Floor   28
## Floor   27
## Floor   26
## Floor   25
## Floor   24
## Floor   23
## Floor   22
## Floor   21
## Floor   20
## Floor   19
## Floor   18
## Floor   17
## Floor   16
## Floor   15
## Floor   14
## Floor   13
## Floor   12
## Floor   11
## Floor   10
## Floor    9
## Floor    8
## Floor    7
## Floor    6
## Floor    5
## Floor    4
## Floor    3
## Floor    2
## Floor    1

2 Probabilities and Statistics

Textbook


2.1 Probability Space

\[ \Large (\Omega, \mathscr{F}, P) \]

  • \(\Omega\): Sample space, all of the occurring events.

  • \(\mathscr{F}\): \(\sigma\)-algebra, hard to explained, simply define \(\Omega\in\mathscr{F}\), fixed the math.

  • \(P\): probability measure, the value of the occurring event, and between 0 and 1.


2.1.1 simply example

Probability space of tossing a fair coin, then:

  • \(\Omega = \{H,T\}\)

  • \(\mathscr{F} = \{H,T\}\)

  • \(P(\{H\})=P(\{T\})=\frac{1}{2}\)

2.1.2 Probability Measure

  • Three Axiom

    • \(0\leq P(A)\leq 1,\;\forall A\)

    • \(P(\Omega) = 1\)

    • \(P(\bigcup_{i=1}^n A_i)=\sum_{i=1}^n P(A_i), \text{where }A_i\cap A_j=\varnothing,\;\forall i,j\)


2.2 Discrete Random Variable

  • Use X,Y,Z,… define random variables, and x,y,z,.. define values.

  • Random Variables \(X\) includes all possible outcomes.

  • Probability is denoted by \(P(X=x)=f(x)\).

  • Mean: \(E(X) = \sum_{x=1}^nx_if(x) = \mu\)

  • variance:\(Var(X) = \sum_{x=1}^n(x-\mu)^2f(x)= E(X-E(X))^2=E(X^2)-(E(X))^2\)


2.2.1 Binomial

\[ \Large X\sim B(n,p) \]

  • \(n\) and \(p\) are parameters of distribution

  • \(n\): n times for trail

  • \(p\): probability of event for each trail

  • Probability Mass Function (pmf):

\[ f_X(x) = \binom{n}{x}\;p^xq^{n-x},\;\forall x = 1, \cdots, n \]


2.3 Continuous Random Variable

  • Probability is denoted by \(P(X\leq x)=\int_{-\infty}^xf(x)dx\), then \(P(X=x)=0\)

  • Mean: \(E(X) = \int_\mathbb {r class.output="code-output"}xf_X(x)dx = \mu\)

  • variance:\(Var(X) = \int_\mathbb {r class.output="code-output"}(x-\mu)^2f_X(x)dx = E(X-E(X))^2=E(X^2)-(E(X))^2\)


2.3.1 Exponential

  • \(X\sim\exp(\lambda)\)

  • Probability Density Function (pdf.) \(f_X(x) = \lambda e^{-\lambda x}\)

f <- function(x, lambda = 1){lambda*exp(-lambda*x)}
curve(f,from = 0, to = 5, col = 1)
curve(f(x,2),from = 0, to = 5, col = 2, add = TRUE)
curve(f(x,.5),from = 0, to = 5, col = 3, add = TRUE)


2.3.2 Normal

\[ \Large X\sim N(\mu, \sigma^2) \]

\[ f_X(x) = \cfrac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}},\;\forall\;x\in\mathbb``` {r class.output="code-output"} \]


2.3.3 t-distribution

  • \(X\sim t_n\)

  • \(t_n\overset{d}{=}\color {r class.output="code-output"ed}{N(0,1)}\;\) as \(\;n\rightarrow\infty\)

curve(dt(x,df=1),from=-4,to=4,col="#000000",ylim=c(0,0.4))
curve(dt(x,df=2),from=-4,to=4,col="#240000",add=T)
curve(dt(x,df=3),from=-4,to=4,col="#460000",add=T)
curve(dt(x,df=4),from=-4,to=4,col="#690000",add=T)
curve(dt(x,df=5),from=-4,to=4,col="#800000",add=T)
curve(dt(x,df=8),from=-4,to=4,col="#990000",add=T)
curve(dt(x,df=10),from=-4,to=4,col="#bb0000",add=T)
curve(dt(x,df=50),from=-4,to=4,col="#cc0000",add=T)
curve(dt(x,df=91),from=-4,to=4,col="#dd0000",add=T)
curve(dt(x,df=100),from=-4,to=4,col="#ee0000",add=T)
curve(dnorm(x,0,1),from=-4,to=4,col="#ff0000",add=T)


3 Data with R


3.1 Random Sample

\[ \mathcal{X}=(X_1, X_2,\cdots, X_{100}),\;\forall\;X_i\overset{iid}{\sim} N(160,10) \]

set.seed(19970608)
#random
x <- rnorm(n = 100, mean = 160, sd = sqrt(10))
sort(x, decreasing = FALSE)
##   [1] 150.2045 150.3638 151.9515 152.4867 153.2311 153.9550 155.0583 155.1287
##   [9] 155.4500 155.8637 156.0542 156.3821 156.4040 156.6807 156.8222 156.9162
##  [17] 156.9413 157.3930 157.4179 157.4487 157.4901 157.5814 157.6270 157.8093
##  [25] 158.1321 158.1779 158.1841 158.2011 158.3338 158.4286 158.5381 158.5747
##  [33] 158.6128 158.6130 158.6211 158.7004 158.8500 158.9164 158.9294 158.9382
##  [41] 158.9697 159.0128 159.0373 159.3363 159.4597 159.5536 159.6015 159.7461
##  [49] 159.8022 159.8833 159.8919 159.9494 159.9613 159.9639 160.1277 160.2413
##  [57] 160.2432 160.2516 160.2878 160.4307 160.5644 160.5740 160.7001 160.8122
##  [65] 160.8769 160.8916 160.9366 160.9986 161.0055 161.0284 161.0291 161.2953
##  [73] 161.3876 161.4904 161.5726 161.6253 161.6258 161.6970 161.7136 161.8119
##  [81] 162.3868 162.4849 162.5548 162.6081 163.2007 163.2842 163.4371 163.7381
##  [89] 163.8430 163.8762 164.2689 164.4121 164.4778 165.3108 165.5170 165.9530
##  [97] 166.4605 167.1191 167.4459 168.5680

3.2 Histogram

hist(x, nclass = 20)


3.3 Data table

library(kableExtra)
df <- read.csv(file = "csv/df.csv", fileEncoding = "Big5")
kable(df)
Name Height Weight Gender Age
Sam Lu 177 82 M 28
Lan Huang 159 45 F 21
Yuji Nishida 187 89 M 24
Elaine Liao 164 53 F 27

3.4 New Data (Male)

library(randomNames)
## Warning: package 'randomNames' was built under R version 4.4.1
set.seed(19970608)
New_df.M <- data.frame(Name = randomNames(n = 68, gender = 0, which.names = "first"),
                       Height = rnorm(68, 175, 10),
                       Weight = rnorm(68, 70, 10),
                       Gender = rep("M", 68),
                       Age = sample(20:40, size = 68, replace = TRUE))
New_df.M

3.5 New Data (Female)

library(randomNames)
set.seed(19970607)
New_df.F <- data.frame(Name = randomNames(n = 28, gender = 1, which.names = "first"),
                       Height = rnorm(28, 160, 10),
                       Weight = rnorm(28, 50, 10),
                       Gender = rep("F", 28),
                       Age = sample(20:40, size = 28, replace = TRUE))
New_df.F

3.6 New Data (n = 100)

New_df <- rbind(df, New_df.M, New_df.F) #cbind()
New_df

3.7 Data editing

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
New_df <- New_df%>%
  mutate(BMI = Weight/(Height/100)^2)%>%
  transform(Gender = as.factor(Gender),
            BMI = round(BMI, digits = 1))
New_df

3.8 Data Information

summary(New_df)
##      Name               Height          Weight      Gender      Age       
##  Length:100         Min.   :141.8   Min.   :33.26   F:30   Min.   :20.00  
##  Class :character   1st Qu.:164.2   1st Qu.:53.76   M:70   1st Qu.:25.00  
##  Mode  :character   Median :171.9   Median :62.46          Median :31.00  
##                     Mean   :171.1   Mean   :62.89          Mean   :30.21  
##                     3rd Qu.:178.5   3rd Qu.:72.78          3rd Qu.:35.00  
##                     Max.   :198.8   Max.   :96.98          Max.   :40.00  
##       BMI       
##  Min.   :10.50  
##  1st Qu.:18.38  
##  Median :21.40  
##  Mean   :21.48  
##  3rd Qu.:24.93  
##  Max.   :33.70

3.9 Data searching

New_df%>%filter(BMI<13)
New_df%>%filter(Height>190)
New_df%>%filter(Age==28)

4 Statistics with R


4.1 One-sample Student’s t Test

\[ H_0: \mu=170 \]

\[ H_1: \mu\neq 170 \]

t.test(New_df$Height, mu=170)
## 
##  One Sample t-test
## 
## data:  New_df$Height
## t = 0.9556, df = 99, p-value = 0.3416
## alternative hypothesis: true mean is not equal to 170
## 95 percent confidence interval:
##  168.8563 173.2688
## sample estimates:
## mean of x 
##  171.0625

4.2 One-sample Student’s t Test

\[ H_0: \mu>170 \]

\[ H_1: \mu\leq 170 \]

t.test(New_df$Height, mu=170, alternative = "greater") #alternative = "less", when H_0:\mu<170
## 
##  One Sample t-test
## 
## data:  New_df$Height
## t = 0.9556, df = 99, p-value = 0.1708
## alternative hypothesis: true mean is greater than 170
## 95 percent confidence interval:
##  169.2163      Inf
## sample estimates:
## mean of x 
##  171.0625

4.3 Two-sample Student’s t Test

\[ H_0: \mu_1<\mu_2 \]

\[ H_1: \mu_1\geq\mu_2 \]

BMI_M <- New_df%>%filter(Gender=="M")
BMI_M <- sample(BMI_M$BMI, 30, replace = F)
BMI_F <- New_df%>%filter(Gender=="F")
BMI_F <- BMI_F$BMI 

#test
t.test(BMI_M, BMI_F, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  BMI_M and BMI_F
## t = 3.7572, df = 57.764, p-value = 0.9998
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 5.818209
## sample estimates:
## mean of x mean of y 
##  22.17333  18.14667

5 Model with R


5.1 Linear Model

\[ Y = \beta_0+\beta_1X+\varepsilon\;,\;\text{where }\;\varepsilon\sim N(0,1) \]

glm1 <- glm(Height~Weight, data = New_df)
summary(glm1)
## 
## Call:
## glm(formula = Height ~ Weight, data = New_df)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 150.72221    4.57094  32.974  < 2e-16 ***
## Weight        0.32341    0.07086   4.564 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 103.0028)
## 
##     Null deviance: 12240  on 99  degrees of freedom
## Residual deviance: 10094  on 98  degrees of freedom
## AIC: 751.24
## 
## Number of Fisher Scoring iterations: 2

5.2 Regression Plot

library(plotly)


est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
                      Weight = New_df$Weight)

plot_ly(New_df, x = ~Weight,
        y = ~Height,
        text = New_df$Name,
        type = 'scatter',
        alpha = 0.65,
        mode = 'markers',
        name = 'Weight-Height')%>%
  add_trace(data = New_est,
            x = ~Weight,
            y = ~Height,
            name = 'Regression Fit',
            mode = 'lines')

5.3 Regression Plot (Gender)

library(plotly)

est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
                      Weight = New_df$Weight)

plot_ly(New_df,
        x = ~Weight,
        y = ~Height,
        text = New_df$Name,
        color = ~New_df$Gender,
        type = 'scatter',
        alpha = 0.65, mode = 'markers',
        name = 'Weight-Height')%>%
  add_trace(data = New_est,
            x = ~Weight,
            y = ~Height,
            name = 'Regression Fit',
            mode = 'lines')

5.4 Regression Plot (Age)

library(plotly)

est_beta <- glm1$coefficients
New_est <- data.frame(Height = est_beta[1]+est_beta[2]*New_df$Weight,
                      Weight = New_df$Weight)

plot_ly(New_df, x = ~Weight,
        y = ~Height,
        text = New_df$Name,
        color = ~New_df$Age,
        type = 'scatter',
        alpha = 0.65,
        mode = 'markers',
        name = 'Weight-Height')%>%
  add_trace(data = New_est,
            x = ~Weight,
            y = ~Height,
            name = 'Regression Fit',
            mode = 'lines')

5.5 Statistical meaning

  • Model: \(Y=X\beta+\varepsilon\), for \(\varepsilon\sim N(0,1)\)

  • Estimate \(\beta_0\), \(\beta_1\)

\[ Y = \begin{pmatrix} Y_1\\ \vdots\\ Y_n \end{pmatrix} ,\; X = \begin{pmatrix} 1 & X_1\\ \vdots & \vdots\\ 1 & X_n \end{pmatrix} ,\;\text{and }\; \beta = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \]

  • \(\hat{\beta} = \begin{pmatrix} \hat{\beta_0}\\ \hat{\beta_1} \end{pmatrix}\) is LSE of \(\beta\)

5.5.1 LSE

  • LSE (Least Squared Estimation)

  • Estimation of \(Y\): \(\hat{Y}=X\hat{\beta}\)

  • Solve \(\hat{\beta}\) from \(Y-X\hat{\beta}=0\)

  • We have:

\[ \hat{\beta} = (X^TX)^{-1}X^TY \]